StartR Workshop
University of Konstanz
November 24, 2024
The correlation package is an easystats package focused on correlation analysis.
A single correlation can be determined with the cor_test() function.
We can plot the correlation using the plot() function.
We can compute correlations between multiple variables with the correlation() function.
# Select variables of interest
variables <- select(fifa, Overall, Potential, Value, Wage)
# Compute correlations
correlation(variables)# Correlation Matrix (pearson-method)
Parameter1 | Parameter2 | r | 95% CI | t(17658) | p
--------------------------------------------------------------------
Overall | Potential | 0.71 | [0.70, 0.71] | 132.69 | < .001***
Overall | Value | 0.56 | [0.55, 0.57] | 90.81 | < .001***
Overall | Wage | 0.60 | [0.59, 0.61] | 99.55 | < .001***
Potential | Value | 0.51 | [0.50, 0.52] | 79.02 | < .001***
Potential | Wage | 0.48 | [0.47, 0.49] | 72.58 | < .001***
Value | Wage | 0.81 | [0.81, 0.82] | 183.96 | < .001***
p-value adjustment method: Holm (1979)
Observations: 17660
We can use the summary()function to get a correlation matrix.
# Compute and save correlations
r <- correlation(variables)
# Generate correlation matrix
summary(r)# Correlation Matrix (pearson-method)
Parameter | Wage | Value | Potential
-----------------------------------------
Overall | 0.60*** | 0.56*** | 0.71***
Potential | 0.48*** | 0.51*** |
Value | 0.81*** | |
p-value adjustment method: Holm (1979)
We can use the plot()function to plot the correlation matrix.
There are numerous regression models available in R. We will focus on the Base R implementation of the linear regression model.
There are three aspects of regression:
lm() function in base R
The parameters and the performance package are part of the easystats package collection.
We can use the lm() function to estimate a simple linear regression model.
# estimate regression model with a single predictor: dv ~ iv
model <- lm(Wage ~ Overall, data = fifa)
# get the standard output
summary(model)
Call:
lm(formula = Wage ~ Overall, data = fifa)
Residuals:
Min 1Q Median 3Q Max
-39708 -7903 -2542 5097 399598
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -88622.18 980.27 -90.41 <2e-16 ***
Overall 1527.74 15.35 99.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16390 on 17658 degrees of freedom
Multiple R-squared: 0.3595, Adjusted R-squared: 0.3595
F-statistic: 9911 on 1 and 17658 DF, p-value: < 2.2e-16
We can use the parameters() and performance() to evaluate the model.
Parameter | Coefficient | SE | 95% CI | t(17658) | p
-------------------------------------------------------------------------------
(Intercept) | -88622.18 | 980.27 | [-90543.60, -86700.76] | -90.41 | < .001
Overall | 1527.74 | 15.35 | [ 1497.66, 1557.82] | 99.55 | < .001
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
-----------------------------------------------------------------------------
3.929e+05 | 3.929e+05 | 3.929e+05 | 0.359 | 0.359 | 16387.478 | 16388.406
We can use the check_model() function to check whether the assumptions of a linear regression are met.
We can also use the lm() function to estimate a multiple regression model.
# estimate regression model with a single predictor: dv ~ iv
model2 <- lm(Wage ~ Overall + Value, data = fifa)
summary(model2)
Call:
lm(formula = Wage ~ Overall + Value, data = fifa)
Residuals:
Min 1Q Median 3Q Max
-132632 -3555 -908 2282 316273
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.045e+04 8.083e+02 -37.67 <2e-16 ***
Overall 5.315e+02 1.300e+01 40.89 <2e-16 ***
Value 1.810e-03 1.332e-05 135.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11460 on 17657 degrees of freedom
Multiple R-squared: 0.6868, Adjusted R-squared: 0.6867
F-statistic: 1.936e+04 on 2 and 17657 DF, p-value: < 2.2e-16
We can use the compare_performance() function to compare the performance of two or more models.
# Comparison of Model Performance Indices
Name | Model | AIC (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE
----------------------------------------------------------------------------------
model | lm | 3.9e+05 (<.001) | 3.9e+05 (<.001) | 0.359 | 0.359 | 16387.478
model2 | lm | 3.8e+05 (>.999) | 3.8e+05 (>.999) | 0.687 | 0.687 | 11459.711
t-test() functionThe t-test() function has two basic forms:
| Test | Wide Format | Long Format |
|---|---|---|
| One-sample | t.test(x) |
|
| Independent | t.test(x, y) |
t.test(y ~ x, data) |
| Dependent | t.test(x, y, paired = T) |
t.test(y ~ x, data, paired = T) |
The function has various additional options, e.g.
var.equal for equal variances (TRUE or FALSE)alternative for one-sided tests (less or more)mu for the null hypothesis value (default is 0)conf.level for the confidence interval (default is 0.95)t-tests can be performed with the t.test() function in base R. If one variable is provided, a single sample t-test is performed.
Welch Two Sample t-test
data: RSA_Pre by Sex
t = -8.8367, df = 17.154, p-value = 8.54e-08
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-1.1327500 -0.6963409
sample estimates:
mean in group female mean in group male
4.561818 5.476364
# Compare pre- and post-training performance within individuals
t.test(hiit$RSA_Pre, hiit$RSA_Post1, paired = TRUE)
Paired t-test
data: hiit$RSA_Pre and hiit$RSA_Post1
t = 6.9213, df = 21, p-value = 7.733e-07
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
0.1255983 0.2334927
sample estimates:
mean difference
0.1795455
afex (Analysis of Factorial EXperiments) is dedicated to analysis of variance.
emmeans (Estimated Marginal MEANS) is dedicated to post-hoc tests.
aov_4() functionThe aov_4() function is part of the afex package and can be used to estimate a variety of ANOVA models.
Generic format: aov_4(dv ~ iv_b + (iv_w | id), data)
dv is the dependent variableiv_b are the between-subject predictor variablesiv_w are the within-subject predictor variablesid is the subject identifier variabledata is the data frameemmeans() functionThe emmeans() function is part of the emmeans package. It has the following generic format:
Generic format: emmeans(model, ~ iv)
model is the model objectiv are the predictor variablesBetween-subject ANOVA
# Between-subject predictor: Sex
data <- dplyr::filter(hiit_long, Measure == "RSA" & Time == "Pre")
afex::aov_4(Value ~ Sex + (1 | ID), data)Anova Table (Type 3 tests)
Response: Value
Effect df MSE F ges p.value
1 Sex 1, 20 0.06 78.09 *** .796 <.001
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
Within-subject ANOVA
# Within-subject predictor: Time
data <- dplyr::filter(hiit_long, Measure == "RSA")
afex::aov_4(Value ~ 1 + (Time | ID), data)Anova Table (Type 3 tests)
Response: Value
Effect df MSE F ges p.value
1 Time 1.30, 27.20 0.04 7.16 ** .020 .008
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
Sphericity correction method: GG
# Within-subject predictor: Time
data <- dplyr::filter(hiit_long, Measure == "RSA")
model <- afex::aov_4(Value ~ 1 + (Time | ID), data)
# Esimate the marginal means
emm <- emmeans::emmeans(model, ~ Time)
emm Time emmean SE df lower.CL upper.CL
Pre 5.02 0.112 21 4.79 5.25
Post1 4.84 0.118 21 4.59 5.09
Post2 4.97 0.119 21 4.73 5.22
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Pre - Post1 0.1795 0.0259 21 6.921 <.0001
Pre - Post2 0.0459 0.0601 21 0.764 0.7289
Post1 - Post2 -0.1336 0.0548 21 -2.440 0.0589
P value adjustment: tukey method for comparing a family of 3 estimates
# Between-subject predictor: Sex, Within-subject predictor: Time
data <- dplyr::filter(hiit_long, Measure == "RSA")
afex::aov_4(Value ~ Sex + (Time | ID), data)Anova Table (Type 3 tests)
Response: Value
Effect df MSE F ges p.value
1 Sex 1, 20 0.16 90.01 *** .770 <.001
2 Time 1.26, 25.24 0.04 6.97 ** .081 .010
3 Sex:Time 1.26, 25.24 0.04 0.43 .005 .566
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
Sphericity correction method: GG
# Between-subject predictor: Sex, Within-subject predictor: Time
data <- dplyr::filter(hiit_long, Measure == "RSA")
model <- afex::aov_4(Value ~ Sex + (Time | ID), data)
# Esimate the marginal means
emm <- emmeans::emmeans(model, ~ Sex * Time)
# Estimate simple effects
emmeans::joint_tests(emm, by = "Time")Time = Pre:
model term df1 df2 F.ratio p.value
Sex 1 20 78.087 <.0001
Time = Post1:
model term df1 df2 F.ratio p.value
Sex 1 20 99.553 <.0001
Time = Post2:
model term df1 df2 F.ratio p.value
Sex 1 20 43.946 <.0001